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AN  ANOMALY  IN  THE  PEARSON  PRODUCT-MOMENT 
CORRELATION  COEFFICIENT 


An.  Anomaly  in  the  Pearson  Product-Moment  Correlation  Coefficient 


When  the  range  of  one  or  both  variables  in  a  simple  Pearson  product- 
moment  correlation  is  artificially  restricted,  the  absolute  value  of  the 
sample  correlation  coefficient  is  reduced  below  that  of  the  true  (population) 
coefficient.  This  phenomemon  is  responsible  for  the  well  known  difficulty 
of  finding  useful  correlations  between  grades  and  other  indices  of 
intellectual  performance  in  such  preselected  populations  as  college  students  — 
especially  graduate  students  (cf.  Wallach,  1976). 

Restriction  of  range  is  expressed  relative  to  the  population,  with  the 
most  widely  accepted  index  being  the  ratio  of  the  sample  standard  deviation 
to  the  population  standard  deviation  (Chisel  1 i ,  1964;  Guilford,  1954). 

Ghiselli  has  presented  the  mathematical  proofs  which  establish  the  required 
correction  for  attenuation  when  the  population  variance  is  known  or  can  be 
estimated  accurately.  In  the  limit,  as  the  ratio  of  the  sample  standard 
deviation  to  population  standard  deviation  approaches  zero,  the  Pearson 
correlation  coefficient  approaches  zero. 

What  is  rarely  made  clear  in  classroom  discussions  is  that  correlation 
is  not  merely  a  function  of  the  marginal  distribution  but  of  the  conditional 
distributions  as  well.  Perhaps  for  this  reason,  there  appears  to  be  a 
widespread  misunderstanding  of  this  phenomenon,  leading  to  the  belief  (until 
recently  shared  by  the  authors)  that  any  variable  with  a  variance  near  zero 
would  necessarily  show  a  correlation  coefficient  near  zero  with  any  other 
variable,  and  that  decreasing  variance  necessarily  leads  to  decreasing 
absolute  values  of  coefficients.  For  this  reason,  it  seems  useful  to 
describe  an  anomaly  recently  encountered  in  our  research. 
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The  Anomaly 

During  the  course  of  research  on  the  development  of  flying  training 
objectives  for  the  U.S.  Air  Force  (Barron,  Gerlach  &  Haygood,  1976),  we  had 
occasion  to  collect  ratings  of  the  "observability"  of  training  objectives 
using  a  standard  1_  to  5_  rating  scale.  In  addition  to  ratings  of  complete 
statements  of  objectives,  we  also  collected  ratings  of  the  three  component 
parts  of  such  statements— the  condition,  verb,  and  criterion  (e.g.,  “given 
a  pair  of  9-dlglt  mmbers/add  the  two  numbers /without  error").  For  a 
surprising  number  of  statement  components,  there  was  near-universal  agreement 
among  the  subjects  participating,  generating  data  such  as  those  shown  in 
Table  1.  For  example,  all  subjects  but  one  rated  "with  10%  accuracy"  as 
being  highly  observable— a  1_  on  the  rating  scale.  On  scanning  the  data, 
we  were  confident  that  the  computed  correlation  coefficient  for  Table  1 
would  be  near  2ero,  because  of  the  small  variance;  the  discovery  of  a 
coefficient  of  +1.00  came  as  a  distinct  surprise. 

In  retrospect,  it  is  clear  why  the  coefficient  must  be  1.00:  a  straight 
line  will  fit  the  data  perfectly,  and  increases  or  decreases  in  the  variance 
resulting  from  the  addition  or  elimination  of  pairs  falling  on  either  data 
point  (1,1  or  2,3)  will  not  change  the  data  pattern.  The  correlation 
coefficients  computed  for  varying  numbers  of  1,1  pairs  and  extra  selected 
data  points  are  shown  in  Table  2.  Of  special  interest  is  the  fact  that  addition 
of  extra  subjects  at  data  point  1,1  decreases  the  variance  but  increases  the 
correlation  coefficient  for  the  last  three  rows  of  Table  2.  The  actual 
coefficients  in  the  last  three  rows  are  dependent  on  the  arbitrary  selection 
of  the  extra  data  points,  but  the  pattern  is  the  same  regardless  of  which  extra 
points  are  chosen. 
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Table  1 

Pairs  of  scores  found  in  Rating-Scale  Research  Data3 


Subject  No. 
1 
2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


Rating  on 
Scale  A 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 


Rating  on 
Scale  B 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

1 

1 

1 

1 

1 


a 

rAB  * 
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Table  2 


Pearson  Correlation  Coefficients  for  Selected  Combinations  of  Scores 


Number  of  Actual  Score 


Extra  Pairs 

Pairs 

Number  of  1,1 

pairs 

la 

5 

10 

100 

1000 

0 

- 

.00 

.00 

.00 

.00 

.00 

1 

2,3 

1.00 

1.00 

1.00 

1.00 

1.00 

2 

2, 3/2 ,2 

.87 

.93 

.94 

.95 

.95 

3 

2,3/2, 2/1 ,2 

.71 

.82 

.84 

.86 

.87 

4 

2, 3/2 ,2/1, 2/2,1 

.32 

.57 

.63 

.70 

.71 

aall  non-zero 

coefficients  are  positive 

Conclusion 


When  small  numbers  of  subjects  are  used  with  discrete  rating  scales  of 
limited  range,  it  seems  likely  that  data  patterns  of  the  type  shown  here 
will  often  be  found.  If  all  subjects  use  the  same  category,  that  rating 
necessarily  correlates  zero  with  any  other  scale.  If  one  or  two  subjects 
choose  divergent  responses  on  each  of  two  scales,  the  correlation  may  be 
quite  high,  even  perfect,  regardless  of  the  variances  obtained.  Thus,  the 
belief  that  decreased  variance  in  one  or  both  variables  is  always  associated 
with  reduced  correlation  is  clearly  incorrect.  As  an  instructional  note, 
we  recommend  that  classroom  presentations  of  the  restriction  of  range  problem 
be  structured  to  avoid  leaving  such  an  impression  with  the  student.  While 
the  data  analyst  himself  is  not  likely  to  be  misled,  the  risk  is  that  those 
who  do  not  have  access  to  the  raw  data  may  take  such  correlations  at  face 
value  and  assume  great  predictive  power  where  none  exists. 
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